In the modern age of social media, partisan politics, and abundance of false information, internet users must excersise great mindfulness in online media consumption. In this paper, we address this issue by building models to categorically classify online news based on their headlines. We hope that these techinques may be implemented by individuals, organizations, or policy makers to notice trends in how media influence affects individuals or populations, as well as the the genre of news that these individuals or populations tend to consume. Although users cannot practice enough vigilance in ensuring that they do not fall victim to misinformation, our model can assist in efficiently finding relevant and truthful information. We train models using logistic regression, boosting, random forests, and neural networks. The best selected model can classify by category with 49% accuracy1, 34% better than the baseline model.
The United States is becoming increasingly dependent on the internet for news. In the 2018 General Social Survey (GSS), 45.9% percent of respondants report the internet as their primary source for news, 7.6% more than than television and newspaper combined. These data are shown in Table 1.1.
| Internet | Newspaper | Television | Other |
|---|---|---|---|
| 29.4 | 14.4 | 44.7 | 11.5 |
This result is consistent with the increasing reliance on the internet for news over the past 10 years of GSS data, as shown in Figure 1.1, and it represents a dramatic change in the way Americans access information. The digital media company eMarketer reports that 2019 is the first year in which the rate of adult media consumption via the internet has surpassed the rate of media consumption via television2[2].
Figure 1.1: Media Consumption by Year
These trends are among those who seek out the news; we might, however, expect the breadth of online media influence to far surpass these reported percentages. According to the DataReportal Digital 2019 Report[3], 95% of Americans use the internet, and 70% are active at least monthly on social media. As social media increasingly becomes a medium for advertising, this supermajority of the US population is constantly exposed to the media’s digital influence. While this may not inherently be negative, social media has been consumed by an epidemic of misinformation. A common example of this was false and targeted advertising in the 2016 Presidential election.
DataReportal[4] estimates that at the onset of 2017, there were 214 million active Facebook users in the United States. Among these users, Yonder[5] reports that there were 76.5 million engagements with deftly targeted polarizing adverstiments before the Novermber 2016 election. This targeted advertizing in turn impacts the websites are directed towards and the forms of news they see.
However, this pheneomena of polarization extends beyond even politically motivated advertisements into all of digital news reporting, with what has become known as clickbait. Indeed, Reis et. all[6] find that.
Interestingly, our results suggest that a headline has more chance to be successful if the sentiment expressed in its text is extreme, towards the positive or the negative side. Results suggest that neutral headlines are usually less attractive.
‘Clickbait’ is a method of headlining that strategically withholds information from the title of an article to make the article seem more enticing, born out necessity due to the plethora of information available on the web. While clickbait can be as atrocious as titles such as “Is your boyfriend cheating on you?…He is, if he does these five things”[7], many are much more subtle, such as “‘Shark Tank’ star Barbara Corcoran, AKA NY’s Queen of Real Estate, reveals a brilliant tip for paying off mortage”[8]. This form of digitial media is also not convincingly negative by design, but it is inexplicably linked with extreme headlines and political polarization.
This incentive for the extremes in news headlining results in a distillation of the news which is actually relevant to an individual. Our model can be applied to several very relevant features in the world. Firstly, we can use it to compare the consumption of media by genre before and after the advent of clickbait and widespread polarizing media. Is the distriubtion over the news that people consume the same or different? It can also be used at an individual scale to determine the distrubition of media consumption over categories for an individual. This can be used to ensure one is accessing the information that is relevant to them, and that nothing is left to the wayside. It can also be applied to different subpopulations, to how the distribution of media consumed by subpolution varies. Further discussion of this will be included in Section 4.2
We are using a dataset of roughly 200,000 HuffPost headlines from 2012-2018, labeled by category category[9]. We find this dataset to be appropriate for several reasons. As stated above, we are particularly interested in classifying internet news headlines, as well as evaluating for bias. HuffPost is a major provider of internet news. Additionally, in a 2017 Gallup and Knight Foundations poll[10], Americans voted HuffPost to be the fourth most biased media source of 16 popular sources, such as The New York Times, NPR, and Fox News. HuffPost also reports on a very wide variety of topics, giving the opportunity to train a somewhat robust model for classification.
The data contains 200,853 observations. Each observation is stored as a tuple with the label, headline, author, link, and date. There are 100 different authors and 41 different categories. A list of the original labels and the number of observations given each label is provided in Table 1.2.
| Number | |
|---|---|
| ARTS | 1509 |
| ARTS & CULTURE | 1339 |
| BLACK VOICES | 4528 |
| BUSINESS | 5937 |
| COLLEGE | 1144 |
| COMEDY | 5175 |
| CRIME | 3405 |
| CULTURE & ARTS | 1030 |
| DIVORCE | 3426 |
| EDUCATION | 1004 |
| ENTERTAINMENT | 16058 |
| ENVIRONMENT | 1323 |
| FIFTY | 1401 |
| FOOD & DRINK | 6226 |
| GOOD NEWS | 1398 |
| GREEN | 2622 |
| HEALTHY LIVING | 6694 |
| HOME & LIVING | 4195 |
| IMPACT | 3459 |
| LATINO VOICES | 1129 |
| MEDIA | 2815 |
| MONEY | 1707 |
| PARENTING | 8677 |
| PARENTS | 3955 |
| POLITICS | 32739 |
| QUEER VOICES | 6314 |
| RELIGION | 2556 |
| SCIENCE | 2178 |
| SPORTS | 4884 |
| STYLE | 2254 |
| STYLE & BEAUTY | 9649 |
| TASTE | 2096 |
| TECH | 2082 |
| THE WORLDPOST | 3664 |
| TRAVEL | 9887 |
| WEDDINGS | 3651 |
| WEIRD NEWS | 2670 |
| WELLNESS | 17827 |
| WOMEN | 3490 |
| WORLD NEWS | 2177 |
| WORLDPOST | 2579 |
In this section, we will first discuss ways in which we are modifying the collected data to be appropriate for this study, and then we will look for relationships in the data to motivate our analysis.
A. Investigating Missing Data
We first look to see what data is missing. We provide a summary of the missing data in Table 2.1
| Num.missing | |
|---|---|
| category | 0 |
| headline | 0 |
| authors | 36620 |
| link | 0 |
| short_description | 19708 |
| date | 0 |
We then used the links included in the data set to check the articles with missing authors or short description and found that it is not the data set missing this information but that certain articles simply do not publish a short description or the author’s name. We therefore decided to create new indicator variables regarding missing authors and descriptions. This will be discussed more in detail in Section 2.1 B.
B. Changing Labels
Notice from Table 1.2 that several categories seem to be very similar. For this reason, we combine labels using a mapping shown in the appendix and end up with 26 unique labels. We show the distrubtion over modified categories in Figure 2.1, and a table of the new labels is provided below.
| x |
|---|
| ARTS & CULTURE |
| BUSINESS |
| COMEDY |
| CRIME |
| DIVERSITY |
| EDUCATION |
| ENTERTAINMENT |
| ENVIRONMENT |
| FIFTY |
| FOOD & DRINK |
| HEALTHY LIVING |
| HOME & LIVING |
| IMPACT |
| MARRIAGE |
| MEDIA |
| PARENTING |
| POLITICS |
| RELIGION |
| SCIENCE |
| SPORTS |
| STYLE & BEAUTY |
| TECH |
| TRAVEL |
| WEIRD NEWS |
| WOMEN |
| WORLD NEWS |
Figure 2.1: Pct of Observations Per Modified Category
For the headlines and descriptions, we separately first remove all non-english words, puncation, numbers, stem all of the words, and strip all white space except for single spaces, and save them each in their own corpus. We then remove words that do not appear in 99% of headlines and descriptions, respectively. Summaries of the document term matrices with term frequency weighting is provided in Table 2.3.
| Feature | Number.Entries | Pct.Non.Sparse |
|---|---|---|
| Headline | 35793801 | 1.003 |
| Description | 79265674 | 1.101 |
After preparing the predictor words from text mining, we also prepare additional meta-information to include in our analysis. In total, we introduce several seven new variables in order to improve classification:
We provide several plots to see if any of our added predictors may be significant. First, we look at the distribution of publications per weekday in Figure 2.2. We see that the distribution is roughly uniform.
Figure 2.2: Prop of Reviews Per Weekday
We next explore the distribution of labels over weekdays, shown in Figure 2.3.
Figure 2.3: Association Plot
We see that the strongest correlation is 0.27 between SCIENCE and Saturdays. The two weakest correlations are 0.03 between FOOD & DRINK and Sunday as well as MARRIAGE and Sunday. Generally speaking, there are quite some differences between classes and the respective most common publishing days. Some categories such as ENVIRONMENT and SCIENCE are much more frequently published on the weekends, while others such as FOOD & DRINK and MARRIAGE are significantly less frequent on the weekends. This indicates that the publishing date might be a good predictor for the categories.
For our analyses, we combine all the aforementioned variables and then randomly split the data into a training, validation, and testing set, following a 80/10/10 percent split.
We used three methods for our model building:
We want some notion of knowing how well our classifiers our doing. Notice from Figure 2.1 that the most popular category is POLITICS. At the very least, we would like our classifier to do better than the most trival case, predicting the most popular option every time. In our test data, 16% of the labels are POLITICS. Hence, we should expect our classifiers to perform significantly better than this.
A. Model Description
We next will be analyzing: this data set using random forests. A random forest is an ensemble method for predicting classes based on the predictions of often hundreds of decision trees. In particular trees that are grown very deep to learn highly irregular patterns. These overfit their training sets, meaning they have low bias, but very high variance. Random forests are a way of averaging these multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model providing realiable estimates of feature importance.
We run the model with a sufficiently large number of trees (200) and a standard number of variables at split point of \(\sqrt{p}\), with p being the number of features. We built the random forest, using the ranger package, with all possible variables for induction, and 24 possible variables sampled at each try. We also calculate the importance of predictors, running it in a supervised mode. We used the ranger package for its speed advantages and advanced output options, such as variable importance. The top 20 most important variables can be seen in Table 3.1.
| Importance | |
|---|---|
| trump | 1960.5825 |
| weekdays | 1585.9131 |
| photo | 1453.5733 |
| avg_desc | 1279.5745 |
| avg_headline | 1265.6531 |
| recip | 981.6664 |
| trump.1 | 817.6174 |
| wed | 804.7043 |
| divorc | 761.1033 |
| gay | 744.7377 |
| missing_author | 611.4749 |
| gop | 592.9337 |
| parent | 577.1738 |
| travel | 564.1825 |
| fashion | 506.7345 |
| wed.1 | 503.1471 |
| divorc.1 | 486.3447 |
| travel.1 | 481.8247 |
| missing_desc | 477.2228 |
| kid | 464.9667 |
B. Results
We trained this model on the training dataest and used the testing data to achieve a test accuracy of approximately 40%. This is considered a decent performance when compared to a baseline accuracy of 16%.
The “Confusion Plot” in Figure 3.1 shows an interesting relation: the distribution of predictions given a label. The vertical axis is the predicted labels, and the horizontal axis is the true labels. The shading and inscribed values represent the percentage of the time that a given label was predicted given a true label. For example, the 0.02 where SPORTS on the vertical axis intersects COMEDY on the horizontal axis means that SPORTS was predicted 2% of the time that the actual label was COMEDY.
Ideally every label is properly predicted, which would result in a plot where the intersection of every label with itself was inscribed with a 1 (dark blue) and inscribed with a 0 (dark red) otherwise. We can see on this plot that this is not the case for most categories. We do however, notice that the model does tend to predict the correct label more often than not, evidenced by the fact that the values along the main diagonal in each column tend to be darker than the rest of the values in that column.
We can also use this plot to see which categories tend to be confused with others. For instance, we see that POLITICS, ENTERTAINMENT and HEALTHY LIVING are often false positives. For example, the true label ARTS & CULTURES is more likely predicted as POLITICS or HEALTHY LIVING than it is to be correctly classified. This can be explained through the contentwise closesness as well as the class imbalance. This is especially true for extreme cases such as FIFTY, which is very close in content to many categories but rarely appears in the training set, leading to nearly a 100% rate of miscategorization.
Figure 3.1: Confusion Plot
A. Model Formulation
Boosting is a very potent tool for error reduction in decision trees. The main difference compared to classical random forests is that boosting pools together many shallow decision trees (weak learners) instead of deep ones. These have low variance but high bias. In order to reduce the overall error, Boosting then tries to reduce the bias by sequentially adding new trees, that perform better in cases where the previous ones performed poorly. Gradient boosting specifically utilises the gradient of a loss function, e.g. log loss, to find the next tree to add. Boosting is especially advantageous in unbalanced scenarios, which is why we consider it for this project. We implemented a model using extreme gradient boosting from the xgboost package for its speed advantage and relatively high accuracy with large and complex data. We then crossvalidate the hyperparameters learning rate “eta”, “maximum tree depths” (complexity of trees) and number of iterations/trees.
B. Results:
We achievend an accuracy of 47% We crossvalidated our model parameters with using xgb.cv with 10 folds and found the best results for a learning rate eta of 0.3, 100 iterations and maximum tree depths of 3. We use the multinomial log loss as a performance metric and multi:softprob for the output objective, providing the probability of each class.
A. Model Formulation
We now train a neural net to compare to the random forest. Neural networks (NN’s) were developed in the mid 1900’s as an attempt to create a simple model of the human brain. They are structured as directed graphs, where nodes are referred to as ‘neurons,’ and edges are stored as the weight the parent’s value has in determining the child’s value. NN’s are advantageous for several reasons. They are capable of learning complex, non-linear relationships in data, with very few assumptions needed to make this possible. In particular, there are no distributional assumptions, and the model allows for heteroskedasticity.
Neural nets are able to effectively learn very complex systems, but at the sake of interpretability. We decided to implement a neural network due to the fact that we hava a wide range of different predictors, from textmining as well as the meta information, on a relatively large data set. Neural networks provide relatively fast and accurate models in scenarios where we otherwise would have problems with feature selection due to memory constraints. For instance, the multinomial logistic regression takes significantly longer with much higher memory demands. We therefore trade the interpretability for better performance and speed.
We have set up a neural network with 512 input neurons, 256 hidden neurons and 26 output neurons The input nodes use the ReLu function whereas the output one uses the softmax function. This means that the output gives us the probability if each possible category. We fine tune this model by introducing dropouts to improve the generalisability of the network. The dropout rate determines how many random neurons in each layer will be ignored. This is done to reduce the importance of specific neurons and thus help against overfitting. We expect an improvement in terms of test error from this. The dropout rate was eventually set to 0.2.
B. Results:
We found that the neural network with dropout provides the best accuracy yet at 49% percent. We provide a plot of the loss and accuracy of the neural net per epoch below.
A. Model Formulation
Lastly, we used regression mainly to check for biases regarding the meta-information in the presentation of news.
We investigate the regression coefficients regarding the meta information such as day of publishing, average word length or word count to check for differences in the respective categories. We used a multinomial grouped LASSO to determine the variables of importance for most categories. We then created our model using the multinom function from the nnet package for its speed and performance advantages.
B. Results
Surprisingly, even meta-information alone achieves a test accuracy of around 20%. Regarding the differences in regession coefficients shown in Figure ??.
The association plot shows the regression coefficients for each category and predictor. We use this to spot irregulary strong associations between predictors and categories. For instance, the plot shows a highly negative coefficient of “Home & Living” and missing description. This means that this category is strongly associated with a description. At the same time “Religion” is strong associated with a missing description. This could be caused by sensitive topics not having a short catchy description, where as topics about home, marriage, etc are often featuring companies and organisations in talk about a light hearted topic. We can also see differences regarding missing author names for categories such as CRIME, ENVIRONMENT and WORLD NEWS vs. EDUCATION and IMPACT. With the former ones having a strongly positive association and the latter ones a strongly negative, meaning that for instance articles about “Crimes” are more likely to not feature the authors name while articles about “Education” are more likely to feature the authors name. This might also be explained by some topics being of very sensitive nature and thus making journalists to opt not to publish their names while other articles might provide desired publicity.
We found the best performing model with some margin to be the neural network with a dropout rate of 0.2 and gradient boosted random forest with a learning rate of 0.3. Interestingly, catgories such as MARRIAGE where especially affected by this and significantly better predicted by neural networks and gradient boosted random forests.
We constructed a model that is accurate in classifying online articles based on their headlines with a testing accuracy of nearly 50%. This is 34% better than the baseline classifier. We now discuss limitations and future directions. We also notice that often the model finds two or three categories to have nearly identical likelihoods, but then drops off sharply. We often find that these categories tend to be very similar (e.g. FOOD AND DRINK, HEALTHY LIVING, and HOME & LIVING). For this reason, we also look to see if the true label is in one of the top three predicted categories. Using this notion of accuracy, our model predicts correctly with up to 70% accuracy.
There are several limitations to our experiment. For example, we do not have access to information regarding the demographics of the individuals reading the articles in our data. This would be extremely useful in extending beyond classification in our study into analysis of how media consumption varies within groups. We also notice that in the original data, several categories were very similar to each other, and we found no formalization of these categories on HuffPost’s website. For this reason, we are led to believe that this data was amateurly labeled and hence the accuracy of the model is restricted to the biases and opinions of the individual(s) who labeled the data. We also do not have information on the sway of political articles in this dataset, which would allow us to analyze for bias as well.
Regarding the model creation, we only used computational less intensive tuning of our hyperparameters. With a more extensive approach to tune hyperparameters we could improve predicting power even more. Especially the regression approach could benefit from this, by including more predictors before running a thorough feature selection process.
In the future, we would like to see these formulations applied to a broader dataset, involving longitudinal data with demographic information on articles read by individuals over time. This would allow us to see trends in how individuals consume data, such as whether they believe they consume data in a more uniform way than they do. We also would be interested in looking at trends over subpopulations. For example, how does the distribution over categories differ between the Southeast and Northwest?
Lastly, we would like to see similar techniques applied to articles labeled for political bias also. With this data, we could look at relationships between bias and consumption. This would be especially important at the times of elections, as it would allow us to look for changes in media consumption and bias in the most politically important and volatile times.
1. Geiger, A. (2019, September). Key findings about the online news landscape in america. Pew Research Center. Pew Research Center. Retrieved from https://www.pewresearch.org/fact-tank/2019/09/11/key-findings-about-the-online-news-landscape-in-america/
2. US time spent with media 2019. (2019). eMarketer. Retrieved from https://www.emarketer.com/content/us-time-spent-with-media-2019
3. Kemp, S. (2019, January). Digital 2019: The united states of america datareportal global digital insights. DataReportal. DataReportal Global Digital Insights. Retrieved from https://datareportal.com/reports/digital-2019-united-states-of-america
4. Kemp, S. (2019, January). Digital 2017: The united states of america datareportal global digital insights. DataReportal. DataReportal Global Digital Insights. Retrieved from https://datareportal.com/reports/digital-2017-united-states-of-america
5. (2019). Yonder. Retrieved from https://cdn2.hubspot.net/hubfs/4326998/ira-report-rebrand_FinalJ14.pdf
6. Dos Rieis, J. C. S., Souza, F. B. de, Melo, P. O. S. V. de, Prates, R. O., Kwak, H., & An, J. (2015). Breaking the news: First impressions matter on online news. In Ninth international aaai conference on web and social media.
7. Bhattarai, A. (2018, October). You won’t believe how these 9 shocking clickbaits work! (Number 8 is a killer!). Medium. The Zerone. Retrieved from https://medium.com/zerone-magazine/you-wont-believe-how-these-9-shocking-clickbaits-work-number-8-is-a-killer-4cb2ceded8b6
8. Smith, B. (2016, July). 10 irresistible clickbait facebook advertising examples. AdEspresso. Retrieved from https://adespresso.com/blog/clickbait-facebook-advertising-examples/
9. Misra, R. (2018, December). News category dataset. Kaggle. Retrieved from https://www.kaggle.com/rmisra/news-category-dataset
10. Perceived accuracy and bias in the news media. (n.d.-). Knight Foundation. Retrieved from https://knightfoundation.org/reports/perceived-accuracy-and-bias-in-the-news-media/